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Abstract 

Itemset mining has been an active area of research 
due to its successful application in various data mining 
scenarios including finding association rules. Though 
most of the past work has been on finding frequent 
itemsets, infrequent itemset mining has demonstrated its 
utility in web mining, bioinformatics and other fields. 
In this paper, we propose a new algorithm based on the 
pattern-growth paradigm to find minimally infrequent 
itemsets. A minimally infrequent itemset has no subset 
which is also infrequent. We also introduce the novel 
concept of residual trees. We further utilize the residual 
trees to mine multiple level minimum support itemsets 
where different thresholds are used for finding frequent 
itemsets for different lengths of the itemset. Finally, we 
analyze the behavior of our algorithm with respect to 
different parameters and show through experiments that 
it outperforms the competing ones. 

Keywords: Itemset Mining, Minimal Infrequent Item- 
sets, Residual Tree, Projected Tree. 

1 Introduction 

Mining frequent itemsets has found extensive utilization 
in various data mining applications including consumer 
market-basket analysis [ 1 1, inference of patterns from web 
page access logs IfTTj . and iceberg-cube computation Q. 
Extensive research has, therefore, been conducted in find- 
ing efficient algorithms for frequent itemset mining, espe- 
cially in finding association rules 131 . 

However, significantly less attention has been paid to 
mining of infrequent itemsets, even though it has got im- 
portant usage in (i) mining of negative association rules 
from infrequent itemsets [12|, (ii) statistical disclosure 



risk assessment where rare patterns in anonymous cen- 
sus data can lead to statistical disclosure [7], (iii) fraud 
detection where rare patterns in financial or tax data may 
suggest unusual activity associated with fraudulent behav- 
ior Q, and (iv) bioinformatics where rare patterns in mi- 
croarray data may suggest genetic disorders |7|. 

The large body of frequent itemset mining algorithms 
can be broadly classified into two categories: (i) candi- 
date generation-and-test paradigm and (ii) pattern-growth 
paradigm. In earlier studies, it has been shown experi- 
mentally that pattern-growth based algorithms are com- 
putationally faster on dense datasets. 

Hence, in this paper, we leverage the pattern-growth 
paradigm to propose an algorithm IFP_min for mining 
minimally infrequent itemsets. For some datasets, the set 
of infrequent itemsets can be exponentially large. Report- 
ing an infrequent itemset which has an infrequent proper 
subset is redundant, since the former can be deduced from 
the latter. Hence, it is essential to report only the mini- 
mally infrequent itemsets. 

Haglin et al. proposed an algorithm, MINIT, to mine 
minimally infrequent itemsets Q . It generated all poten- 
tial candidate minimal infrequent itemsets using a ranking 
order of the items based on their supports and then vali- 
dated them against the entire database. 

Instead, our proposed IFP_min algorithm proceeds by 
processing minimally infrequent itemsets by partitioning 
the dataset into two parts, one containing a particular item 
and the other that does not. 

If the support threshold is too high, then less number 
of frequent itemsets will be generated resulting in loss 
of valuable association rules. On the other hand, when 
the support threshold is too low, a large number of fre- 
quent itemsets and consequently large number of associ- 
ation rules are generated, thereby making it difficult for 
the user to choose the important ones. Part of the problem 
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Tid 


Transactions 


Ti 


F, E 


T 2 


A, B, C 


T 3 


A, B 


T 4 


A, D 


T 5 


A, C,D 


T 6 


B, C, D 


T 7 


E, B 


T 8 


E, C 


T 9 


E, D 



Table 1 : Example database for infrequent itemset mining. 



lies in the fact that a single threshold is used for gener- 
ating frequent itemsets irrespective of the length of the 
itemset. To alleviate this problem, Multiple Level Min- 
imum Support (MLMS) model was proposed |5|, where 
separate thresholds are assigned to itemsets of different 
sizes in order to constrain the number of frequent itemsets 
mined. This model finds extensive applications in market 
basket analysis for optimizing the number of associa- 
tion rules generated. We extend our IFP_min algorithm to 
the MLMS framework as well. 

In summary, we make the following contributions: 

• We propose a new algorithm IFP_min for mining 
minimally infrequent itemsets. To the best of our 
knowledge, this is the first such algorithm based on 
pattern-growth paradigm (Section|5]l. 

• We introduce the concept of residual trees using a 
variant of the FP-tree structure termed as inverse FP- 
tree (Section|4]). 

• We propose an optimization on the Apriori algorithm 



to mine minimally infrequent itemsets (Section 5.4 1 



• We present a detailed study to quantify the impact of 
variation in the density of datasets on the computa- 
tion time of Apriori, MINIT and our algorithm. 

• We extend the proposed algorithm to mine frequent 
itemsets in the MLMS framework (Section|6]l. 



1.1 Problem Specification 

Consider / = {x%, X2, ■ ■ . , x n } to be a set of items. An 
itemset X C I is a subset of items. If its length or cardi- 
nality is fc, it is referred to as a k-itemset. A transaction 
T is a tuple (tid, X) where tid is the transaction identifier 
and X is an itemset. It is said to contain an itemset Y if 
and only if Y C X. A transaction database TD is simply 
a set of transactions. 



Each itemset has an associated statistical measure 
called support. For an itemset X, supp(X, TD) = 
X.count where X. count is the number of transactions in 
TD that contains X. For a user defined threshold a, an 
itemset is frequent if and only if its support is greater than 
or equal to a. It is infrequent otherwise. 

As mentioned earlier, the number of infrequent item- 
sets for a particular database may be quite large. It may 
be impractical to generate and report all of them. A key 
observation here is the fact that if an itemset is infrequent, 
so will be all its supersets. Thus, it makes sense to gener- 
ate only minimal infrequent itemsets, i.e., those which are 
infrequent but whose all subsets are frequent. 

Definition 1 (Minimally Infrequent Itemset). An itemset 
X is said to be minimally infrequent for a support thresh- 
old a if it is infrequent and all its proper subsets are fre- 
quent, i.e., supp(X) < a and Vy C X, supp(Y) > a. 

Given a particular support threshold, our goal is to ef- 
ficiently generate all the minimally infrequent itemsets 
(Mils) using the pattern-growth paradigm. 

Consider an example transaction database shown in Ta- 
ble [T] If a = 2, {B, D} is one of the minimally infre- 
quent itemsets for the transaction database. All its subsets, 
i.e., {B} and {D}, are frequent but it itself is infrequent 
as its support is 1. The whole set of Mils for the trans- 
action database is {{E, B}, {E, C}, {E, D}, {B, D}, 
{A, B, C}, {A, C, D}, {A, E}, {F}}. Note that {B, F} 
is not a Mil since one of its subsets {F} is infrequent as 
well. 

In the MLMS framework, the problem is to find all 
frequent (equivalently, infrequent) itemsets with differ- 
ent support thresholds assigned to itemsets of different 
lengths. We define as the minimum support thresh- 
old for a k-itemset (k = 1, 2, . . . , n) to be frequent. A 
k-itemset X is frequent if and only if supp(X, TD) > a/,. 

For efficient processing of MLMS itemsets, it is useful 
to sort the list of items in each transaction in increasing or- 
der of their support counts. This is called the i-flist order. 
Also, let ai ow be lowest minimum support threshold. 

Most applications use the constraint o\ > o~2 > ■ ■ • > 
cr„. This is intuitive as the support of larger itemsets can 
only decrease or at most remain constant. In this case, 
viow = "n- O ur algorithm IFPJVILMS, however, does 
not depend on this assumption, and works for any general 
01) . . . ,a n . 

Consider an example transaction database shown in Ta- 
ble^ The frequent itemsets corresponding to the different 
thresholds are shown in Table [3] 
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Items in i-flist order 






Ti 


A, C, T, W 


A, T, W, C 






T 2 


C, D, W 


D, W, C 






T 3 


A, C, T, W 


A, T, W, C 






T 4 


A, D, C, W 


A, D, W, C 






T 5 


A, T, C, W, D 


A, D, T, W, 






T 6 


C, D, T, B 


B, D, T, C 






Table 2: Example database for MLMS model. 




Frequent k-itemsets 


0"1 


= 4 


{C}, {W}, {T}, {D}, {A} 




CT2 


= 4 


{C, D}, {C, W}, {C, A}, {W, A}, {C, T} 


0-3 


= 3 


{C, W, T}, {C, W, D}, {C, W, A}, 






{C, T, A}, {W, T, A} 




CT 4 


= 2 


{C, W, T, A}, {C, D, W, A} 






= 1 


{C, W, T, D, A} 




Table 3: Frequent k-itemsets for database in Table |5] 



2 Related Work 

The problem of mining frequent itemsets was first intro- 
duced by Agrawal et al. (JJ, who proposed the A/? non- 
algorithm. Apriori is a bottom-up, breadth-first search al- 
gorithm that exploits the downward closure property "all 
subsets of frequent itemsets are frequent". Only candi- 
date frequent itemsets whose subsets are all frequent are 
generated in each database scan. Apriori needs I database 
scans if the size of the largest frequent itemset is /. In this 
paper, we propose a variation of the Apriori algorithm for 
mining minimally infrequent itemsets (Mils). 

In 1 8 1, Han et al. introduced a novel algorithm known 
as the FP-growth method for mining frequent itemsets. 
The FP-growth method is a depth-first search algorithm. 
A data structure called the FP-tree is used for storing the 
frequency information of itemsets in the original transac- 
tion database in a compressed form. Only two database 
scans are needed for the algorithm and no candidate gen- 
eration is required. This makes the FP-growth method 
much faster than Apriori. In [6|, Grahne et al. introduced 
a novel FP-array technique that greatly reduces the need 
to traverse the FP-trees. In this paper, we use a variation 
of the FP-tree for mining the Mils. 

To the best of our knowledge there has been only one 
other work that discusses the mining of Mils. In Q, 
Haglin et al. proposed the algorithm MINIT which is 
based upon the SUDA2 algorithm developed for find- 
ing unique itemsets (itemsets with no unique proper sub- 
sets) (9] [10] . The authors also showed that the minimal 
infrequent itemset problem is NP-complete 0. 



In [5 1, Dong et al. proposed the MLMS model for con- 
straining the number of frequent and infrequent itemsets 
generated. A candidate generation-and-test based algo- 
rithm Ap riori_MLMS was proposed in [5|. The downward 
closure property is absent in the MLMS model, and thus, 
the Apriori _MLMS algorithm checks the supports of all 
possible k-itemsets occurring at least once in the transac- 
tion database, for finding the frequent itemsets. Gener- 
ally, the support thresholds are chosen randomly for dif- 
ferent length itemsets with the constraint cr, > <Tj,Vi < j. 
In B|, Dong et al. extended their proposed algorithm 
from to include an interestingness parameter while 
mining frequent and infrequent itemsets. 



3 Need for Residual Trees 

By definition, an itemset is a minimally infrequent itemset 
(Mil) if and only if it is infrequent and all its subsets are 
frequent. Thus, a trivial algorithm to mine all Mils would 
be to compute all the subsets for every infrequent itemset 
and check if they are frequent. This involves finding all 
the frequent and infrequent itemsets in the database and 
proceeding with checking the subsets of infrequent item- 
sets for occurrence in the large set of frequent itemsets. 
This is a simple but computationally expensive algorithm. 

The use of residual trees reduces the computation time. 
A residual tree for a particular item is a tree representation 
of the residual database corresponding to the item, i.e., 
the entire database with the item removed. We show later 
that a Mil found in the residual tree is a Mil of the entire 
transaction database. 

The projected database, on the other hand, corresponds 
to the set of transactions that contains a particular item. 
A potential minimal infrequent itemset mined from the 
projected tree must not have any infrequent subset. The 
itemset itself is a subset since it is actually the union with 
the item of the projected tree that is under consideration. 
As we show later, the support of only this itemset needs 
to be computed from the corresponding residual tree. 

In this paper, our proposed algorithm IFPjnin uses a 
structure similar to the FP-tree [8| called the IFP-tree. 
This is due to the fact that the IFP-tree provides a more 
visually simplified version of the residual and projected 
trees that leads to enhanced understanding of the algo- 
rithm. A similar algorithm FP_min can be designed that 
uses the FP-tree. The time complexity remains the same. 
In the next section, we describe in detail the IFP-tree and 
the corresponding structures, projected tree and residual 
tree. 
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Figure 1: IFP-tree corresponding to the transaction 



database in TableJTJf^-Tg). 



4 Inverse FP-tree (IFP-tree) 

The inverse FP-tree (IFP-tree) is a variant of the FP- 
tree (8). It is a compressed representation of the whole 
transaction database. Every path from the root to a node 
represents a transaction. The root has an empty label. Ex- 
cept the root node, each node of the tree contains four 
fields: (i) item id, (ii) count, (iii) list of child links, and 
(iv) a node link. The item id field contains the identifier 
of the item. The count field at each node stores the sup- 
port count of the path from the root to that node. The list 
of child links point to the children of the node. The node 
link field points to another node with the same item id that 
is present in some other branch of the tree. 

An item header table is used to facilitate the traversal 
of the tree. The header table consists of items and a link 
field associated with each item that points to the first oc- 
currence of the item in the IFP-tree. All link entries of 
the header table are initially set to null. Whenever an 
item is added into the tree, the corresponding link entry 
of the header table is updated. Items in each transaction 
are sorted according to their order in i-flist. 

For inserting a transaction, a path that shares the same 
prefix is searched. If there exists such a path, then the 
count of the common prefix is incremented by one in the 
tree and the remaining items of the transaction (which do 
not share the path) are attached from the last node with 
their count value set to 1. If items of a transaction do 
not share any path in the tree, then they are attached from 
the root. The IFP-tree for the sample transaction database 
in Table [T] (considering only the transactions T2 to Tg) is 
shown in Figure [T] 

The IFPjnin algorithm recursively mines the mini- 
mally infrequent itemsets (Mils) by dividing the IFP-tree 



into two sub-trees: projected tree and residual tree. The 
next two sections describe them. 



4.1 Projected Tree 

Suppose TD be the transaction database represented by 
the IFP-tree T and TD X denotes the database of transac- 
tions that contain the item x. The projected database cor- 
responding to the item x is the database of these transac- 
tions TD X , but after removing the item x from the trans- 
actions. The IFP-tree corresponding to this database is 
called the projected tree Tp m of item x in T. Figure 2a 
shows the projected tree of item A, which is the least fre- 
quent item in Figure [T] 

The IFPjnin algorithm considers the projected tree of 
only the least frequent item (lf-item). Henceforth, for sim- 
plicity, we associate every IFP-tree with only a single pro- 
jected tree which is that of the If-item. Moreover, since the 
items are sorted in the i-flist order, there would be a single 
node of the If-item x in the IFP-tree. Thus, the projected 
tree of x can be obtained directly from the IFP-tree by 
considering the subtree rooted at node x. 



4.2 Residual Tree 

The residual database corresponding to an item x is the 
database of transactions obtained by removing item x 
from TD. The IFP-tree corresponding to this database is 
called residual tree Tr^ of item x in T. Figure 2b shows 
the residual tree of the least frequent item A. It is ob- 
tained by deleting the node corresponding to item A and 
then merging the subtree below that node into the main 
tree at appropriate positions. 

Similar to the projected tree, the IFPjnin algorithm 
considers the residual tree of only the lf-item. Since there 
is only a single node of the lf-item x in the IFP-tree, the 
residual tree of x can be obtained directly from the IFP- 
tree by deleting the node x from the tree and then merging 
the subtree below the node x with the rest of the tree. 

Furthermore, the projected and residual tree of the next 
item (i.e., E) in the i-flist is associated with the residual 
tree of the current item (i.e., ^4). Figure [3] shows the pro- 
jected and residual trees of the item E for the tree Tr a . 



5 Mining Minimally Infrequent 
Itemsets 

In this section, we describe the IFPjnin algorithm that 
uses a recursive approach to mine minimally infrequent 
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Algorithm 1 IFP_min 



Figure 2: (a) Projected tree Tp A and (b) Residual tree Tp A 
of item A for the IFP-tree shown in Figure [Tj 





Figure 3: (a) Projected tree and (b) Residual tree of item 



E in the residual tree Tr a shown in Figure 2b 



itemsets (Mils). The steps of the algorithm are shown in 
Algorithm [T] 

The algorithm uses a dot-operation (•) that is used to 
unify sets. The unification of the first set with the second 
produces another set whose i th element is the union of the 
entire first set with the i th element of the second set. Math- 
ematically, {x} • {Si, . . . , S n } = {{x} USi,..., {x} U 
S n } and {x} • = 0. 

The infrequent 1-itemsets are trivial Mils and so they 
are reported and pruned from the database in Step 1. After 
this step, all the items present in the modified database 
are individually frequent. IFP_min then selects the least 
frequent item (lf-item, Step 3) and divides the database 
into two non-disjoint sets: projected database and residual 
database of the lf-item (Step 1 1 and Step 12 respectively). 

The IFP_min algorithm is then applied to the residual 



Input: T is an IFP-tree 

Output: Minimally infrequent itemsets (Mils) in db(T) 
1: infreqiT) <— infrequent 1-itemsets 
2: T <- IFP-tree(d6(T) - infreq(T)) 
3: x <— least frequent item in T (the lf-item) 
4: if T has a single node then 
5: if {x} is infrequent in T then 
6: return {x} 
7: else 

return 
end if 
end if 

Tp x <— projected tree of x 
Tr x <— residual tree of x 
Sh IFP_min (residual tree Tr ) 



^Sp 



Sp IFP_min (projected tree Tp x ) 

Sp infreq <- {%} • (Sp - S R ) 

S2(x) {x} • (itemslTp^) — items^Tp^)) 

return {Sr^, U S Pinfrei U S 2 (x) U infreq(T)} 



database in Step 13 and the corresponding Mils are re- 
ported in Step 14. In the base case (Step 6 to Step 12) 
when the residual database consists of a single item, the 
Mil reported is either the item itself or the empty set 
accordingly as the item is infrequent or frequent respec- 
tively. 

After processing the residual database, the algorithm 
mines the projected database in Step 15. The itemsets in 
the projected database share the lf-item as a prefix. The 
Mils obtained from the projected database by recursively 
applying the algorithm are compared with those obtained 
from residual database. If an itemset is found to occur in 
the second set, it is not reported; otherwise, the lf-item is 
included in the itemset and is reported as an Mil of the 
original database (Step 16). 

IFP_min also reports the 2-itemsets consisting of the 
lf-item and frequent items not present in the projected 
database of the lf-item (Step 17). These 2-itemsets have 
support zero in the actual database and hence also qualify 
as Mils. 

5.1 Example 

Consider the transaction database TD shown in Table Q] 
Figure[4]shows the recursive partitioning of the tree T cor- 
responding to TD. The box associated with each tree rep- 
resents the Mils in that tree for a — 2. 

The algorithm starts by separating the infrequent items 
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from the database. This results in removal of the item F 
(which is inherently a Mil). The lf-item in the modified 
tree T is A. The algorithm then constructs the projected 
tree Tp A and the residual tree Tr a corresponding to item 
A and recursively processes them to yield Mils containing 
A and Mils not containing A respectively. 

• Mils not containing A (itemsets obtained from T Ra ) 

The lf-item in Tr a is E. Therefore, similar to the 
first step, Tr a is again divided into projected tree 
Tp E and residual tree Tp E , which are then recur- 
sively mined. 

o Mils not containing E (itemsets obtained from 

Tr e ) 

Every 1-itemset is frequent. By recursively 
processing the residual and projected trees 
of B, {B, D} is obtained as a MIL Itemset 
{C, D} is also obtained as a potential Mil. 
However, since it is frequent, it is not returned. 
Since {B, C} is also frequent, only {B, D} is 
returned from this tree, 
o Mils containing E (itemsets obtained from 
Tp e ) 

All the 1 -itemsets {B},{C},{D} are in- 
frequent (Step 6 of the algorithm). E 
is included with these itemsets to return 
{E,B},{E,C},{E,D}. 

Tp E and Tp E are mutually exclusive. Hence, the 
combined set {{£?, D}, {E, B}, {E, C}, {E, D}} 
forms the Mils not containing A. 

• Mils containing A (itemsets obtained from Tp A ) 

Item {E} is infrequent (support = 0). Hence, 
{A, E}} forms a MIL Similarly, itemset {B,D} 
with support = is also obtained as a potential MIL 
The other Mils obtained from recursive processing 
are {B, C}, {C, D}. The itemset {B, D}, however, 
appears as a Mil in both Tp A and Tp A . Hence, it 
is removed (Step 16 of the algorithm). This avoids 
the inclusion of the itemset {A, B, D} which is not 
a Mil since {B, D} is infrequent (as shown in Tp A ). 
A is included with the remaining set of Mils from 
T Pa to form the itemsets {A, B, C}, {A, C, D}. 

The combined set {{A, B, C}, {A, C, D}, {A, E}} 
thus forms the Mils containing A. 

As mentioned in Step 3 of the algorithm, the infrequent 
1-itemset {F} is also included. Hence, in all, the algo- 
rithm returns the set {{B, D}, {E, B}, {E, C], {E, D}, 
{A, B, C}, {A, C, D}, {A, E}, {F}} as the Mils for the 
database TD. 



5.2 Completeness and Soundness 

In this section, we prove formally the our algorithm is 
complete and sound, i.e., it returns all minimally infre- 
quent itemsets and all itemsets returned by it are mini- 
mally infrequent. 

Consider the lf-item x in the transaction database. Mils 
in the database can be exhaustively divided into the fol- 
lowing sets: 

• Group 1: Mils not containing x 
This can be again of two kinds: 

(a) Itemsets of length 1 : 

These constitute the infrequent items in the 
database obtained from the initial pruning. 

(b) Itemsets of length greater than 1 : 

These constitute the minimally infrequent item- 
sets obtained from the residual tree of x. 

• Group 2: Mils containing x 

This consists of itemsets of the form {x} U S where 
S can be of following two kinds: 

(a) S occurs with x in at least one transaction: 
All items in S occur in the projected tree of x. 

(b) S does not occur with x in any transaction: 
Note that the path corresponding to S does not 
exist in the projected tree of x. Also, S is a 
1-itemset. Assume the contrary, i.e., S con- 
tains two or more items. A subset of S would 
then exist, which would not occur with x in any 
transaction. As a result, the subset would also 
be infrequent and {x} U S would not be qual- 
ified as a MIL Thus, S is a node that is absent 
in the projected tree of x. 

The following set of observations and theorems prove 
the correctness of the IFP_min algorithm. They exhaus- 
tively define and verify the set of itemsets returned by the 
algorithm. 

The first observation relates the (in)frequent itemsets of 
the database to those present in the residual database. 

Observation 1. An itemset S (not containing the item x) 
is frequent (infrequent) in the residual tree Tp x if and only 
if the itemset S is frequent (infrequent) in T, i.e., 

S is frequent inT <^> S is frequent in Tr x (x £ S) 
S is infrequent inT ^ S is infrequent in Tp^ (x ^ S) 
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Figure 4: Running example for the transaction database in Table [TJ 



Proof. Since S does not contain x, it does not occur in the 
projected tree of x. All occurrences of S must, therefore, 
be only in the residual tree Tr^, i.e., 



supp(S,T) = supp(S,T R:c ) 



(1) 



Hence, S is frequent (infrequent) in T if and only if S is 
frequent (infrequent) in Tr x . □ 

The following theorem shows that the Mils that do not 
contain the item x can be obtained directly as Mils from 
the residual tree of x. 

Theorem 1. An itemset S (not containing the item x) is 
minimally infrequent in T if and only if the itemset S is 
minimally infrequent in the residual tree Tp m of x, i.e., 



S is minimally infrequent in T 
S is minimally infrequent in Tp^ ( x 



S) 



Proof. Suppose S is minimally infrequent in T. There- 
fore, it is itself infrequent, but all its subsets S' C S are 
frequent. 

As S does not contain x, all occurrences of S or any 
of its subsets S' C S must occur in the residual tree Tr x 
only. Hence, using Observation [T] in Tp T also, S is in- 
frequent but all S' C S are frequent. Therefore, S is 
minimally infrequent in Tr as well. 



The converse is also true, i.e., if S is minimally infre- 
quent in Tr x , since all its occurrences are in Tr x only, it 
is globally minimally infrequent (i.e., in T) as well. □ 

Theorem [T] guides the algorithm in mining Mils not 
containing the least frequent item (lf-item) x. The algo- 
rithm makes use of this theorem recursively, by reporting 
Mils of residual trees as Mils of the original tree. For 
mining of Mils that contain x, the following observation 
and theorem are presented. 

The second observation relates the (in)frequent item- 
sets of the database to those present in the projected 
database. 

Observation 2. An itemset S is frequent (infrequent) in 
the projected tree Tp x if and only if the itemset obtained 
by including x in S (i.e., x U S), is frequent (infrequent) 
in T, i.e., 

xUS is frequent in T 4=> S is frequent in Tp m 
xL) S is infrequent inT <S4> S is infrequent in Tp w 

Proof. Consider the itemset x U S. All occurrences of 
it are only in the projected tree Tp x . The projected tree, 
however, does not list x, and therefore, we have, 



supp(x U S, T) — supp(x U S, Tp m 



supp(S, T Px ) 
(2) 
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Hence, x U S is frequent (infrequent) in T if and only if S 
is frequent (infrequent) in Tp x . □ 

The next theorem shows that the potential Mils ob- 
tained from the projected tree of an item x (by appending 
x to it) is a Mil provided it is not a Mil in the correspond- 
ing residual tree of x. 

Theorem 2. An itemset {x} U S is minimally infrequent 
in T if and only if the itemset S is minimally infrequent in 
the projected tree Tp x but not minimally infrequent in the 
residual tree Tr x , i.e., 

{x} U S is minimally infrequent in T 

S is minimally infrequent in Tp x and 
S is not minimally infrequent in Tr x 

Proof. LHStoRHS: 

Suppose {x} U S is minimally infrequent in T. There- 
fore, it is itself infrequent, but all its subsets S' C {x}US, 
including S, are frequent. 

Since S is frequent and it does not contain x, using Ob- 
servation [T] S is frequent in Tr x and is, therefore, not 
minimally infrequent in Tr x . 

From Observation [2] S is infrequent in Tp x . Assume 
that S is not minimally infrequent in Tp x . Since it is in- 
frequent itself, there must exist a subset S' C S which 
is infrequent in Tp x as well. Now, consider the itemset 
{x} U S'. From Observation^ it must be infrequent in T. 
However, since this is a subset of {x} U S, this contradicts 
the fact that {x} U S is minimally infrequent. Therefore, 
the assumption that S is not minimally infrequent in Tp x 
is false. 

Together, it shows that if {x} U S is minimally infre- 
quent in T, then S is minimally infrequent in Tp x but not 

RHS to LHS: 

Given that S is minimally infrequent in Tp x but not in 
Tf? x , assume that {x} U S is not minimally infrequent in 
T. Since S is infrequent in Tp x , using Observation [2] 
{x} U S is also infrequent in T, 

Now, since we have assumed that {x} U S is not min- 
imally infrequent in T, it must contain a subset A C 
{x} U S which is infrequent in T as well. 

Suppose A contains x, i.e., x € A. Consider the itemset 
B such that A = {x} U B. Note that since A C {x} U S, 
B C S. Since A = {x} U B is infrequent in T, from 
Observation [2] B is infrequent in Tp r . However, since 
B C S, this contradicts the fact that S is minimally infre- 
quent in Tp T . Hence, A cannot contain x. 

Therefore, it must be the case that x ^ A. To show that 
this leads to a contradiction as well, we first show that 



Now, if S is minimally infrequent in Tp m but not in Tp x 
and {x} U S is not minimally infrequent in T, then every 
subset 5' C S is frequent in T Px . Thus, VS' C S,S' 
is frequent in T as well (since Tp x is only a part of T). 
Then, if S is infrequent, it must be a MIL However, from 
Theorem[T] it becomes a Mil in Tp T which is a contradic- 
tion. Therefore, S cannot be infrequent and is, therefore, 
frequent in T. 

Since, A C {x} U S but x ^ A, therefore, A C S. 
We have already shown that A is infrequent in T. Using 
the Apriori property, S cannot then be frequent. Hence, 
it contradicts the original assumption that {x} U S is not 
minimally infrequent in T. 

Together, we get that if S is minimally infrequent in 
Tp x but not in Tp x , then {x} U S is minimally infrequent 
in T. □ 

Theorem [2] guides the algorithm in mining Mils con- 
taining the least frequent item (lf-item) x. The algorithm 
first obtains the Mils from its projected tree and then re- 
moves those that are also found in the residual tree. It 
thus shows the connection between the two parts of the 
database, projected and residual. 



5.3 Correctness 

We now formally establish the correctness of the algo- 
rithm IFP_min by showing that the Mils as enumerated 



in Group 1 and Group 2 in Section 5.2 are generated by it. 



In Step 1, the algorithm first finds the infrequent items 
present in the tree. These 1-itemsets cover the Group 1(a). 

Consider the least frequent item x. In Step 1 1 and Step 
12, the tree T is divided into smaller trees, the residual 
tree Tp x and the projected tree Tp x . 

In Step 13, Group 1(b) Mils are obtained by the recur- 
sive application of IFP_min on the residual tree Tr x . The- 
orem[T]proves that these are Mils in the original dataset. 

In Step 14, potential Mils are obtained by the recursive 
application of IFP_min on the projected tree Tp x , Mils 
obtained from Tp x are removed from this set. Combined 
with the item x, these form the Mils enumerated as Group 
2(a). Theorem [2] proves that these are indeed Mils in the 
original dataset. 

The projected database consist of all those transactions 
in which x is present. The Group 2(b) Mils are of length 2 
(as shown earlier). Thus, single items that are frequent but 
do not appear in the projected tree of x, when combined 
with x, constitute Mils with support count of zero. These 
items appear in Tr x though as they are frequent. Hence, 
they are obtained as single items that appear in Tr x but 
not in Tp x as shown in Step 17 of the algorithm. 
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Algorithm 2 IFP JvlLMS 

Input: IFP-tree T with px = P 
Output: Frequent* itemsets of T 

l: if T = then 

2: return 

3: end if 

4: x ■<— ls-item in T 

5: if siipp({x}, T) < 07 ol0 then 

6: Sp <- 

7: else 

8: Tp^. <— projected tree of x 

9: p Tp:l; <- p + 1 

10: Sp <- IFPJvILMS (T Px ) 

ii: end if 

12: Tr x <— residual tree of a; 

13: PTr. «~ P 

14: S fl -s- IFPJMLMS (TrJ 

15: if a; is frequent* in T then 

16: return ({x} • S* P ) US fl U {a;} 

17: else 

18: return ({x} • S P ) U Sr 

19: end if 



The algorithm is, hence, complete and it exhaustively 
generates all minimally infrequent itemsets. 

5.4 Mils using Apriori 

In this section, we show how the Apriori algorithm Qj 
can be improved to mine the Mils. Consider the iteration 
where candidate itemsets of length I + 1 are generated 
from frequent itemsets of length I. From the generated 
candidate set, itemsets whose support satisfies the mini- 
mum support threshold are reported as frequent and the 
rest are rejected. This rejected set of itemsets constitute 
the Mils of length l + l. This is attributed to the fact that 
for such an itemset, all the subsets are frequent (due to the 
candidate generation procedure) while the itemset itself 
is infrequent. For the experimentation purposes, we label 
this algorithm as the Apriori jnin algorithm. 

6 Frequent Itemsets in MLMS 
Model 

In this section, we present our proposed algorithm 
IFPMLMS to mine frequent itemsets in the MLMS 
framework. Though most applications use the constraint 
ci > o"2 > • • • > c„ for the different support thresh- 
olds at different lengths, our algorithm (shown in Algo- 



rithm [2} does not depend on it and works for any general 

Ol, . . . ,<7„. 

We use the lowest support ai ow — minvj o~i in the al- 
gorithm. The algorithm is based on the item with least 
support which we term as the ls-item. 

IFP_MLMS is again based on the concepts of residual 
and projected trees. It mines the frequent itemsets by first 
dividing the database into projected and residual trees for 
the ls-item x, and then mining them recursively. We show 
that the frequent itemsets obtained from the residual tree 
are frequent itemsets in the original database as well. 

The itemsets obtained from the projected tree share the 
ls-item as a prefix. Hence, the thresholds cannot be ap- 
plied directly as the length of the itemset changes. The 
prefix accumulates as the algorithm goes deeper into re- 
cursion, and hence, a track of the prefix is maintained at 
each recursion level. At any stage of the recursive pro- 
cessing, if supp(ls-item) < cri ow , then this item cannot 
occur in a frequent itemset of any length (as any superset 
of it will not pass the support threshold). Thus, its sub-tree 
is pruned, thereby reducing the search space considerably. 

To analyze the prefix items corresponding to a tree, the 
following two definitions are required: 

Definition 2 (Prefix-set of a tree). The prefix-set of a tree 
is the set of items that need to be included with the itemsets 
in the tree, i.e., all the items on which tlie projections have 
been done. For a tree T, it is denoted by At- 

Definition 3 (Prefix-length of a tree). The prefix-length 

of a tree is the length of its prefix-set. For a tree T, it is 
denoted by px- 

For a tree T having At = S and px = p, the cor- 
responding values for the residual tree are At r = S 
and pt r — p while those for the projected tree are 
Ax P = {x} U S and px P = p + 1. For the original 
transaction database TD and its tree T, At = and 
Pt = 0. 

For any k -itemset S in a tree T, the original itemset 
must include all the items in the prefix-set. Therefore, if 
Pt = p, for S to be frequent, it must satisfy the support 
threshold for k +p-length itemsets, i.e., supp(X) > ak+p 
(henceforth, <Jk,p is also used to denote (Jk+p)- The defini- 
tions of frequent and infrequent itemsets in a tree T with 
a prefix-length pt = p are, thus, modified as follows. 

Definition 4 (Frequent* itemset). A k-itemset S is fre- 
quent* in T having px — p if supp(S, T) > <Jk, v - 

Definition 5 (Infrequent* itemset). A k-itemset S is infre- 
quent* in T having px — p if supp(S, T) < <7fc iP . 

Using these definitions, we explain Algorithm [2] along 
with an example shown in Figure [5] for the database in 
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if list 
B: 1 

A : 4 
D i 4 
T:4 
W: 5 
C: 6 




Figure 5: Frequent itemsets generated from the database represented in Table|2]using the MLMS framework. The box 
associated with each tree represents the frequent* itemsets in that tree. 



Table [2] The support thresholds are — 4, er 2 = 4, 
03 = 3, (T4 = 2, and a 5 = 1. 

The item with least support for the initial tree T is B. 
Since its support is above ai ow (Step 5), the algorithm 
extracts the projected tree Tp B and then recursively pro- 
cesses it (Step 11 to Step 13). 

The algorithm then processes the residual tree Tp B re- 
cursively (Step 12 to Step 14). For that, it breaks it into 
the projected and residual trees of the ls-item there, which 
is A. Figure [5] shows all the frequent itemsets mined. 

Since B itself is not frequent in T (Step 15), it is not 
returned. The other itemsets are returned (Step 18). 

6.1 Correctness of IFP MLMS 

Let x be the ls-item in tree T (having pp = p) and S be a 
k-itemset not containing x. 

For computing the frequent* itemsets for a tree T, the 
IFPJV1LMS algorithm merges the frequent* itemsets, ob- 
tained by the processing of the projected tree and residual 



tree of ls-item, using the following theorem 

Theorem 3. An itemset {x} U S is frequent* in T if and 
only if S is frequent* in the projected tree Tp^ of x, i.e., 

{x} U S is frequent* inT S is frequent* in Tp x 

Proof. Suppose S is a fc-itemset. 

S is frequent* in Tp x 

<^> supp{S 1 TpJ > a k ,p+i (since p Tpm = p + 1) 
supp({x} U S, T) > <Tk, P +i (using Observation^ 
supp{{x} US,T) > <Tk+i,p 

■<=> {x} U S is frequent* in T (since p-p = p). □ 

Theorem 4. An itemset S (not containing x) is frequent* 
in T if and only if S frequent* in the residual tree Tp^ of 
x, i.e., 

S is frequent* inT S is frequent* in Tp x 

Proof. Suppose S is a fc-itemset. 

S is frequent* in Tp T 

supp(S 1 TrJ > a k: p (since p TR = p) 
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<^> supp(S, T) > <Jk, P (using Observation^ 

S is frequent* in T (since pt = p)- □ 

The algorithm IFP_MLMS merges the following fre- 
quent* itemsets: (i) frequent* itemsets obtained by in- 
cluding x with those returned from the projected tree 
(shown to be correct by Theorem [5]), (ii) frequent* item- 
sets obtained from the residual tree (shown to be correct 
by Theorem |4| and (iii) 1-itemset {x} if it is frequent* 
in T. The root of the tree T that represents the entire 
database has a null prefix-set, and therefore, zero prefix- 
length. Hence, all frequent* itemsets mined from that 
tree are the frequent itemsets in the original transaction 
database. 



7 Experimental Results 

In this section, we report the experimental results of run- 
ning our algorithms on different datasets. We first re- 
port the performance of IFPjnin algorithm in compar- 
ison with the Apriorijnin and MINIT algorithms, fol- 
lowed by that of the IFPMLMS algorithm. The bench- 
mark datasets have been taken from Frequent Itemset 
Mining Implementations (FIMI) repository http://fimi.ua. 
ac.be/data/ All experiments were run on a machine with 
Dual Core Intel Processor running at 2.4GHz with 8GB 
of RAM. 

7.1 IFPjnin 

The Accident dataset is characteristic of dense and large 
datasets. Figure [6] shows the performance of different 
algorithms on the dataset. The IFPjnin algorithm out- 
performs the MINIT and Apriorijnin algorithm by expo- 
nential factors. Apriori_min algorithm, due to its inher- 
ent property of performing worse than the pattern growth 
based algorithms on dense datasets, performs the worst. 

The Connect (Figure |7J, Mushroom (Figure [8} and 
Chess (Figure |9]l datasets are characteristic of dense and 
small datasets. The Apriorijnin algorithm achieves better 
reduction on the size of candidate sets. However, when 
there exist a large number of frequent itemsets, candidate 
generation-and-test methods may suffer from generating 
huge number of candidates and performing several scans 
of database for support-checking, thereby increasing the 
computational time. The corresponding computational 
times have not been shown in the interest of maintaining 
the scale of the graph present for IFPjnin and MINIT. 

As can be observed from the figures, dense and small 
datasets are characterized by a neutral support threshold 
below which the MINIT algorithm performs better than 



Accidents (Transactions: 340183, Items: 468) 
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Figure 6: Accident Dataset 
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Figure 7: Connect Dataset 



IFPjnin and above which IFPjnin performs better than 
MINIT. The MINIT algorithm prunes an item based on 
the support threshold and length of the itemset in which 
the item is present (minimum support property JJJ). As 
the support thresholds are reduced, the pruning condition 
becomes activated and leads to reduction in search space. 
Above the neutral point, the pruning condition is not ef- 
fective. In IFPjnin algorithm, any candidate Mil item- 
set is checked for set membership in a residual database 
whereas in MINIT the candidates are validated by com- 
puting the support from the whole database. Due to re- 
duced validation space, IFPjnin outperforms MINIT. 



The T10I4D100K (Figure[L0]i and T40I10D100K (Fig- 
ure [XT} are sparse datasets. Since Apriorijnin is a 
candidate-generation-and-test based algorithm, it halts 
when all the candidates are infrequent. As such, it avoids 
the complete traversal of the database for all possible 
lengths. However, both IFPjnin and MINIT, being based 
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Figure 8: Mushroom Dataset 
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Figure 10: T10I4D100K Dataset 
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Figure 9: Chess Dataset 



on the recursive elimination procedure, have to complete 
their full run in order to report the Mils. This results in 
higher computational times for these methods. 

On analyzing and comparing the Mils generated by 
IFP_min, Apriori_min and MINIT algorithms, we found 
that the Mils belonging to Group 2b (i.e., the itemsets hav- 
ing zero support threshold in the transaction database) are 
not reported by the MINIT algorithm, thereby leading to 
its incompleteness. Based on the experimental analysis, it 
is observed that for large dense datasets, it is preferable to 
use IFP_min algorithm. For small dense datasets, MINIT 
should be used at low support thresholds and IFP_min 
should be used at larger thresholds. For sparse datasets, 
ApriorLmin should be used for reporting Mils. 
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Figure 11: T40I10D100K Dataset 



7.2 IFP MLMS 

In this section, we report the performance of IFP JVILMS 
algorithm in comparison with the Apriori JVILMS |5] 
algorithm. Several real and synthetic datasets (ob- 
tained from http://archive.ics.uci.edu/ml/datasets) have 
been used for testing the performance of the algorithms. 
For dense datasets, due to the presence of large number 
of transactions and items, the AprioriJVILMS algorithm 
crashes for lack of memory space. 

The first dataset we used is the Anonymous Microsoft 
Web dataset that records areas of www.microsoft.com 
each user visited in a one week time frame in February 
1998 (http://archive.ics. uci.edu/ml/datasets/Anonymous+ 
Microsoft+Web+Data). The dataset consists of 1,31,666 
transactions and 294 attributes. 



Figure 12 plots the performance analysis of IFP JVILMS 
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Figure 12: The IFPJVILMS and AprioriJvILMS algo- Figure 14: The IFP_MLMS and ApriorLMLMS algo- 
rithms on the Anonymous Microsoft Web Dataset. rithms on the datasets given in Table |4] 
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Figure 13: The IFPJVILMS and Apriori_MLMS algo- 
rithms on the Anonymous Microsoft Web Dataset. 



and AprioriJvILMS algorithms. The minimum support 
thresholds for itemsets of different lengths were varied 
over a distribution window from 2% to 20% at regu- 
lar intervals. The graph clearly shows the superiority of 
IFPJVILMS as compared to AprioriJvILMS. We also ob- 
serve that the time taken by the algorithm to compute the 
frequent itemsets is roughly independent of the minimum 
support thresholds for different lengths. 



For the plot shown in Figure 13 the number of transac- 
tions are varied in the Anonymous Microsoft Web dataset. 
We observe that the running time for both the algorithms 
increases with the number of transactions when the sup- 
port thresholds are kept between 3% to 10%. However, 
the rate of increase in time for AprioriJvILMS algorithm 
is much higher and at 80,000 transactions, AprioriJvILMS 
crashes due to lack of memory. 
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Dataset 



Number of 
transactions 



machine 
voweLcontext 
abalone 
audiology 
anneal 
cloud 
housing 
forestJires 
nursery 
yeast 




Table 4: Details of smaller datasets. 



Since the AprioriJvILMS fails to give results for large 
datasets, smaller datasets were obtained from http:// 
archive.ics.uci.edu/ml/datasets/ for comparison purposes 
with the IFPJVILMS. The characteristics of these datasets 
are shown in Table [3] For each such dataset, the cor- 
responding time for IFPJVILMS and AprioriJVILMS are 
The support threshold percentages 



plotted in Figure 14 



are kept the same for all the datasets and were varied be- 
tween 10% and 60% for the different length itemsets. 

The above results clearly show that the IFPJVILMS al- 
gorithm outperforms AprioriJvILMS. During experimen- 
tation, we found the running time of both IFPJVILMS and 
AprioriJVILMS algorithms to be independent of support 
thresholds for the MLMS model. This behavior is at- 
tributed to the absence of downward closure property [ 1 ] 
of frequent itemsets in the MLMS model, unlike that of 
the single threshold model. 

Further, consider an alternative FP-Growth algorithm 
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for the MLMS model that mines all frequent itemsets cor- 
responding the lowest support threshold <ji ow and then 
filters the frequent fc-itemsets to report the frequent 
itemsets. In this case, a very large set of frequent item- 
sets is generated that renders the filtering process com- 
putationally expensive. The pruning based on er; otu in 
IFP_MLMS ensures that the search space is same for both 
the algorithms. Moreover, the filtering required for the 
former is implicitly performed in IFP_MLMS, thus mak- 
ing IFP_MLMS more efficient. 

8 Conclusions 

In this paper, we have introduced a novel algorithm, 
IFP_min, for mining minimally infrequent itemsets 
(Mils). To the best of our knowledge, this is the first pa- 
per that addresses this problem using the pattern-growth 
paradigm. We have also proposed an improvement of 
the Apriori algorithm to find the Mils. The existing 
algorithms are evaluated on dense as well as sparse 
datasets. Experimental results show that: (i) for large 
dense datasets, it is preferable to use IFP_min algorithm, 
(ii) for small dense datasets, MINIT should be used at 
low support thresholds and IFP_min should be used at 
larger thresholds and (iii) for sparse datasets, Apriori_min 
should be used for reporting the Mils. 

We have also designed an extension of the algorithm 
for finding frequent itemsets in the multiple level mini- 
mum support (MLMS) model. Experimental results show 
that this algorithm, IFP_MLMS, outperforms the exist- 
ing candidate-generation-and-test based AprioriJVILMS 
algorithm. 

In future, we plan to utilize the scalable properties of 
our algorithm to mine maximally frequent itemsets. It 
will be also useful to do a performance analysis of IFP- 
tree in parallel architecture as well as extend the IFP_min 
algorithm across different models for itemset mining, in- 
cluding interestingness measures. 
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